Assignment 1 - Probability, Linear Algebra, & Computational Programming¶

Nick Carroll¶

Netid: nc230

Note: this assignment falls under collaboration Mode 2: Individual Assignment – Collaboration Permitted. Please refer to the syllabus for additional information.

Instructions for all assignments can be found here, and is also linked to from the course syllabus.

Total points in the assignment add up to 90; an additional 10 points are allocated to presentation quality.

Learning Objectives¶

The purpose of this assignment is to provide a refresher on fundamental concepts that we will use throughout this course and provide an opportunity to develop skills in any of the related skills that may be unfamiliar to you. Through the course of completing this assignment, you will...

  • Refresh you knowledge of probability theory including properties of random variables, probability density functions, cumulative distribution functions, and key statistics such as mean and variance.
  • Revisit common linear algebra and matrix operations and concepts such as matrix multiplication, inner and outer products, inverses, the Hadamard (element-wise) product, eigenvalues and eigenvectors, orthogonality, and symmetry.
  • Practice numerical programming, core to machine learning, by loading and filtering data, plotting data, vectorizing operations, profiling code speed, and debugging and optimizing performance. You will also practice computing probabilities based on simulation.
  • Develop or refresh your knowledge of Git version control, which will be a core tool used in the final project of this course
  • Apply your skills altogether through an exploratory data analysis to practice data cleaning, data manipulation, interpretation, and communication

We will build on these concepts throughout the course, so use this assignment as a catalyst to deepen your knowledge and seek help with anything unfamiliar.

If some references would be helpful on these topics, I would recommend the following resources:

  • Mathematics for Machine Learning by Deisenroth, Faisal, and Ong
  • Deep Learning; Part I: Applied Math and Machine Learning Basics by Goodfellow, Bengio, and Courville
  • The Matrix Calculus You Need For Deep Learning by Parr and Howard
  • Dive Into Deep Learning; Appendix: Mathematics for Deep Learning by Weness, Hu, et al.

Note: don't worry if you don't understand everything in the references above - some of these books dive into significant minutia of each of these topics.


Probability and Statistics Theory¶

Note: for all assignments, write out equations and math using markdown and LaTeX. For this assignment show ALL math work for questions 1-4, meaning that you should include any intermediate steps necessary to understand the logic of your solution

1¶

[3 points]
Let $f(x) = \begin{cases} 0 & x < 0 \\ \alpha x^2 & 0 \leq x \leq 2 \\ 0 & 2 < x \end{cases}$

For what value of $\alpha$ is $f(x)$ a valid probability density function?

ANSWER

$ f(x) $ is a valid probability density function if $ \int_{-\infty}^{\infty} f(x) dx = 1 $, therefore $ f(x) $ is a valid probability density function when $ \int_{0}^{2} {\alpha} x^2 dx = 1 $

$$ \int_{0}^{2} {\alpha} x^2 dx = [\frac{1}{3} {\alpha} x^3]_{0}^{2} = \frac{1}{3}[8{\alpha} - 0{\alpha}] = 1 $$
$$ 8{\alpha} = 3 $$

$ f(x) $ is a valid probability density function when: $$ {\alpha} = \frac{3}{8} $$


2¶

[3 points] What is the cumulative distribution function (CDF) that corresponds to the following probability distribution function? Please state the value of the CDF for all possible values of $x$.

$f(x) = \begin{cases} \frac{1}{3} & 0 < x < 3 \\ 0 & \text{otherwise} \end{cases}$

ANSWER

The cumulative distribution function that corresponds to the above probability distribution function is: $$ F(x) = \begin{cases} \frac{1}{3}x & 0 < x < 3 \\ 0 & \text{otherwise} \end{cases} $$


3¶

[6 points] For the probability distribution function for the random variable $X$,

$f(x) = \begin{cases} \frac{1}{3} & 0 < x < 3 \\ 0 & \text{otherwise} \end{cases}$

what is the (a) expected value and (b) variance of $X$. Show all work.

ANSWER

The expected value of X is: $$ E(x) = \int_{-\infty}^{\infty} xf(x) dx = \int_{0}^{3} \frac{1}{3}x dx = \frac{1}{3}[\frac{1}{2} x^2]_{0}^{3} = \frac{9}{6} = 1.5 $$

The variance of X is: $$ Var(x) = E((X - \mu)^2) = E((X - \mu)(X - \mu)) = E(x^2 - 2x\mu + {\mu}^2) = E(x^2) - 2E(x\mu) + E({\mu}^2) $$ $$ Var(x) = \int_{-\infty}^{\infty} x^2 f(x) dx - \int_{-\infty}^{\infty} 2{\mu}x f(x) dx + \int_{-\infty}^{\infty}{\mu}^2 f(x) dx = \frac{1}{3} (\int_{0}^{3} x^2 dx - \int_{0}^{3} 2{\mu}x dx + \int_{0}^{3}{\mu}^2 dx) $$

$$ Var(x) = \frac{1}{3} ([\frac{1}{3} x^3]_{0}^{3} - [{\mu}x^2]_{0}^{3} + [{\mu}^2 x]_{0}^{3}) = \frac{1}{3} (\frac{27}{3} - \frac{27}{2} + \frac{27}{4}) = \frac{36}{12} - \frac{54}{12} + \frac{27}{12} = \frac{3}{4} = 0.75 $$

4¶

[6 points] Consider the following table of data that provides the values of a discrete data vector $\mathbf{x}$ of samples from the random variable $X$, where each entry in $\mathbf{x}$ is given as $x_i$.

Table 1. Dataset N=5 observations

$x_0$ $x_1$ $x_2$ $x_3$ $x_4$
$\textbf{x}$ 2 3 10 -1 -1

What is the (a) mean and (b) variance of the data?

Show all work. Your answer should include the definition of mean and variance in the context of discrete data. In this case, use the sample variance since the sample size is quite small

ANSWER

The expected value of the data is: $$ E(x) = \frac{\sum_{i} x_i}{N} = \frac{2 + 3 + 10 - 1 - 1}{5} = \frac{13}{5} = 2.6

The variance of the data is: $$ Var(x) = \frac{\sum_{i} (x_i - \bar{x})^2}{N-1} = \frac{(2-2.6)^2 + (3-2.6)^2 + (10-2.6)^2 + (-1- 2.6)^2 + (-1-2.6)^2}{5-1} $$ $$ Var(x) = \frac{(-0.6)^2 + 0.4^2 + 7.4^2 + (-3.6)^2 + (-3.6)^2}{4} = \frac{0.36 + 0.16 + 54.76 + 12.96 + 12.96}{4} = 20.3$$

Math Confirmation:

In [ ]:
import numpy as np
x = [2, 3, 10, -1, -1]
print(
    f"The mean of x is {np.mean(x)} and the variance of x is {np.var(x, ddof = 1):.1f}")
The mean of x is 2.6 and the variance of x is 20.3

Linear Algebra¶

5¶

[5 points] A common task in machine learning is a change of basis: transforming the representation of our data from one space to another. A prime example of this is through the process of dimensionality reduction as in Principle Components Analysis where we often seek to transform our data from one space (of dimension $n$) to a new space (of dimension $m$) where $m<n$. Assume we have a sample of data of dimension $n=4$ (as shown below) and we want to transform it into a dimension of $m=2$.

$\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix}$

(a) What are the dimensions of a matrix, $\mathbf{A}$, that would linearly transform our sample of data, $\mathbf{x}$, into a space of $m=2$ through the operation $\mathbf{Ax}$?

(b) Express this transformation in terms of the components of $\mathbf{x}$: $x_1$, $x_2$, $x_3$, $x_4$ and the matrix $\mathbf{A}$ where each entry in the matrix is denoted as $a_{i,j}$ (e.g. the entry in the first row and second column would be $a_{1,2}$). Your answer will be in the form of a matrix expressing result of the product $\mathbf{Ax}$.

Note: please write your answers here in LaTeX

ANSWER

(a) The dimensions of a matrix, $ A $, that would linearly transform our samp of data, $ x $, inta a space of $ m = 2 $ through the operation $ Ax $ is [2, 4].

(b) The resultant matrix (vector) of the operation of $ Ax $ is: $\mathbf{x} = \begin{bmatrix} a_{1,1}x_1 + a_{1, 2}x_2 + a_{1,3}x_3 + a_{1,4}x_4 \\ a_{2,1}x_1 + a_{2, 2}x_2 + a_{2,3}x_3 + a_{2,4}x_4 \end{bmatrix}$


6¶

[14 points] Matrix manipulations and multiplication. Machine learning involves working with many matrices, so this exercise will provide you with the opportunity to practice those skills.

Let $\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 4 & 5 \\ 3 & 5 & 6 \end{bmatrix}$, $\mathbf{b} = \begin{bmatrix} -1 \\ 3 \\ 8 \end{bmatrix}$, $\mathbf{c} = \begin{bmatrix} 4 \\ -3 \\ 6 \end{bmatrix}$, and $\mathbf{I} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

Compute the following using Python or indicate that it cannot be computed. Refer to NumPy's tools for handling matrices. While all answers should be computer using Python, your response to whether each item can be computed should refer to underlying linear algebra. There may be circumstances when Python will produce an output, but based on the dimensions of the matrices involved, the linear algebra operation is not possible. For the case when an operation is invalid, explain why it is not.

When the quantity can be computed, please provide both the Python code AND the output of that code (this need not be in LaTex)

  1. $\mathbf{A}\mathbf{A}$
  2. $\mathbf{A}\mathbf{A}^T$
  3. $\mathbf{A}\mathbf{b}$
  4. $\mathbf{A}\mathbf{b}^T$
  5. $\mathbf{b}\mathbf{A}$
  6. $\mathbf{b}^T\mathbf{A}$
  7. $\mathbf{b}\mathbf{b}$
  8. $\mathbf{b}^T\mathbf{b}$
  9. $\mathbf{b}\mathbf{b}^T$
  10. $\mathbf{b} + \mathbf{c}^T$
  11. $\mathbf{b}^T\mathbf{b}^T$
  12. $\mathbf{A}^{-1}\mathbf{b}$
  13. $\mathbf{A}\circ\mathbf{A}$
  14. $\mathbf{b}\circ\mathbf{c}$

Note: The element-wise (or Hadamard) product is the product of each element in one matrix with the corresponding element in another matrix, and is represented by the symbol "$\circ$".

ANSWER

In [ ]:
A = np.array([[1, 2, 3], [2, 4, 5], [3, 5, 6]])
b = np.array([-1, 3, 8])
c = np.array([4, -3, 6])
  1. The value of $ AA $ is:
In [ ]:
np.matmul(A, A)
Out[ ]:
array([[14, 25, 31],
       [25, 45, 56],
       [31, 56, 70]])
  1. The value of $ A{A}^T $ is:
In [ ]:
np.matmul(A, A.T)
Out[ ]:
array([[14, 25, 31],
       [25, 45, 56],
       [31, 56, 70]])
  1. The value of $ Ab $ is:
In [ ]:
np.matmul(A, b)
Out[ ]:
array([29, 50, 60])
  1. $ A{b}^T $ is not a valid operation because $ A $ is a 3 x 3 matrix, while $ b^T $ is a 1 x 3 matrix, and all matrix multiplication operations must have the second dimension of the first matrix equivalent to the first dimension of the second matrix.
  1. $ bA $ is not a valid operation because $ b $ is a 3 x 1 matrix, while $ A $ is a 3 x 3 matrix, and all matrix multiplication operations must have the second dimension of the first matrix equivalent to the first dimension of the second matrix.
  1. The value of $ {b}^{T}A $ is:
In [ ]:
np.matmul(b.T, A)
Out[ ]:
array([29, 50, 60])
  1. $ bb $ is not a valid operation because $ b $ is a 3 x 1 matrix, while $ b $ is a 3 x 1 matrix, and all matrix multiplication operations must have the second dimension of the first matrix equivalent to the first dimension of the second matrix.
  1. The value of $ b^{T}b $ is:
In [ ]:
np.matmul(b.T, b)
Out[ ]:
74
  1. The value of $ b{b}^T $ is:
In [ ]:
np.matmul(b, b.T)
Out[ ]:
74
  1. $ b + {c}^T $ is not a valid operation because $ b $ is a 3 x 1 matrix, while $ c^T $ is a 1 x 3 matrix, and all matrix addition requires the matrices to have the same dimensions.
  1. $ b^{T}b^{T} $ is not a valid operation because $ b^{T} $ is a 1 x 3 matrix, while $ b^{T} $ is a 1 x 3 matrix, and all matrix multiplication operations must have the second dimension of the first matrix equivalent to the first dimension of the second matrix.
  1. The value of $ A^{-1}b $ is:
In [ ]:
np.matmul(np.linalg.inv(A), b)
Out[ ]:
array([ 6.,  4., -5.])
  1. The value of $ A\circ A $ is:
In [ ]:
A * A
Out[ ]:
array([[ 1,  4,  9],
       [ 4, 16, 25],
       [ 9, 25, 36]])
  1. The value of $ b \circ c $ is:
In [ ]:
b * c
Out[ ]:
array([-4, -9, 48])

7¶

[8 points] Eigenvectors and eigenvalues. Eigenvectors and eigenvalues are useful for some machine learning algorithms, but the concepts take time to solidly grasp. They are used extensively in machine learning and in this course we will encounter them in relation to Principal Components Analysis (PCA), clustering algorithms, For an intuitive review of these concepts, explore this interactive website at Setosa.io. Also, the series of linear algebra videos by Grant Sanderson of 3Brown1Blue are excellent and can be viewed on youtube here. For these questions, numpy may once again be helpful.

  1. Calculate the eigenvalues and corresponding eigenvectors of matrix $\mathbf{A}$ above, from the last question.
  2. Choose one of the eigenvector/eigenvalue pairs, $\mathbf{v}$ and $\lambda$, and show that $\mathbf{A} \mathbf{v} = \lambda \mathbf{v}$. This relationship extends to higher orders: $\mathbf{A} \mathbf{A} \mathbf{v} = \lambda^2 \mathbf{v}$
  3. Show that the eigenvectors are orthogonal to one another (e.g. their inner product is zero). This is true for eigenvectors from real, symmetric matrices. In three dimensions or less, this means that the eigenvectors are perpendicular to each other. Typically we use the orthogonal basis of our standard x, y, and z, Cartesian coordinates, which allows us, if we combine them linearly, to represent any point in a 3D space. But any three orthogonal vectors can do the same. We will see this property is used in PCA to identify the dimensions of greatest variation in our data when we discuss dimensionality reduction.

ANSWER

In [ ]:
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"The eigenvalues for A are: {[x for x in eigenvalues]}.")
print(
    f"The eigenvectors for A are: {[eigenvectors[:, i] for i in range(eigenvectors.shape[1])]}")
The eigenvalues for A are: [11.344814282762082, -0.5157294715892574, 0.1709151888271788].
The eigenvectors for A are: [array([-0.32798528, -0.59100905, -0.73697623]), array([-0.73697623, -0.32798528,  0.59100905]), array([ 0.59100905, -0.73697623,  0.32798528])]
In [ ]:
print(
    f"Av = {np.matmul(A, eigenvectors[:, 1])}, and (lambda)v = {eigenvalues[1]*eigenvectors[:, 1]}.  This confirms this is an eigenvalue, eigenvector pair for A.")
Av = [ 0.38008036  0.16915167 -0.30480078], and (lambda)v = [ 0.38008036  0.16915167 -0.30480078].  This confirms this is an eigenvalue, eigenvector pair for A.
In [ ]:
print(
    f"The values of the matrix multiplication of each of the three eigenvectors is: {np.matmul(eigenvectors[:, 0], eigenvectors[:,1]): .2f}, {np.matmul(eigenvectors[:, 0], eigenvectors[:,2]): .2f}, {np.matmul(eigenvectors[:, 1], eigenvectors[:,2]): .2f}, which confirms the eigenvectors are orthogonal to each other.")
The values of the matrix multiplication of each of the three eigenvectors is: -0.00, -0.00, -0.00, which confirms the eigenvectors are orthogonal to each other.

Numerical Programming¶

8¶

[10 points] Loading data and gathering insights from a real dataset

In data science, we often need to have a sense of the idiosyncrasies of the data, how they relate to the questions we are trying to answer, and to use that information to help us to determine what approach, such as machine learning, we may need to apply to achieve our goal. This exercise provides practice in exploring a dataset and answering question that might arise from applications related to the data.

Data. The data for this problem can be found in the data subfolder in the assignments folder on github. The filename is a1_egrid2016.xlsx. This dataset is the Environmental Protection Agency's (EPA) Emissions & Generation Resource Integrated Database (eGRID) containing information about all power plants in the United States, the amount of generation they produce, what fuel they use, the location of the plant, and many more quantities. We'll be using a subset of those data.

The fields we'll be using include:

field description
SEQPLT16 eGRID2016 Plant file sequence number (the index)
PSTATABB Plant state abbreviation
PNAME Plant name
LAT Plant latitude
LON Plant longitude
PLPRMFL Plant primary fuel
CAPFAC Plant capacity factor
NAMEPCAP Plant nameplate capacity (Megawatts MW)
PLNGENAN Plant annual net generation (Megawatt-hours MWh)
PLCO2EQA Plant annual CO2 equivalent emissions (tons)

For more details on the data, you can refer to the eGrid technical documents. For example, you may want to review page 45 and the section "Plant Primary Fuel (PLPRMFL)", which gives the full names of the fuel types including WND for wind, NG for natural gas, BIT for Bituminous coal, etc.

There also are a couple of "gotchas" to watch out for with this dataset:

  • The headers are on the second row and you'll want to ignore the first row (they're more detailed descriptions of the headers).
  • NaN values represent blanks in the data. These will appear regularly in real-world data, so getting experience working with these sorts of missing values will be important.

Your objective. For this dataset, your goal is to answer the following questions about electricity generation in the United States:

(a) Which plant has generated the most energy (measured in MWh)?

(b) What is the name of the northern-most power plant in the United States?

(c) What is the state where the northern-most power plant in the United States is located?

(d) Plot a bar plot showing the amount of energy produced by each fuel type across all plants.

(e) From the plot in (d), which fuel for generation produces the most energy (MWh) in the United States?

ANSWER

In [ ]:
import pandas as pd
data = pd.read_excel('data/a1_egrid2016.xlsx', header=1)
data.head()
Out[ ]:
SEQPLT16 PSTATABB PNAME LAT LON PLPRMFL CAPFAC NAMEPCAP PLNGENAN PLCO2EQA
0 1 AK 7-Mile Ridge Wind Project 63.210689 -143.247156 WND NaN 1.8 NaN NaN
1 2 AK Agrium Kenai Nitrogen Operations 60.673200 -151.378400 NG NaN 21.6 NaN NaN
2 3 AK Alakanuk 62.683300 -164.654400 DFO 0.05326 2.6 1213.001 1049.863
3 4 AK Allison Creek Hydro 61.084444 -146.353333 WAT 0.01547 6.5 881.000 0.000
4 5 AK Ambler 67.087980 -157.856719 DFO 0.13657 1.1 1315.999 1087.881

a)

In [ ]:
print(
    f"The plant that has generated the most energy is {data.loc[data.loc[:, 'PLNGENAN'] == data.loc[:, 'PLNGENAN'].max(), 'PNAME'].values[0]}.")
The plant that has generated the most energy is Palo Verde.

b)

In [ ]:
print(
    f"The name of the northern most power plant in the US is {data.loc[data.loc[:, 'LAT'] == data.loc[:, 'LAT'].max(), 'PNAME'].values[0]}.")
The name of the northern most power plant in the US is Barrow.

c)

In [ ]:
print(
    f"The state where the northern most power plant is located is {data.loc[data.loc[:, 'LAT'] == data.loc[:, 'LAT'].max(), 'PSTATABB'].values[0]}.")
The state where the northern most power plant is located is AK.

d) A bar plot showing the amount of energy produced by each fuel type is shown below.

In [ ]:
import altair as alt
alt.Chart(data.loc[:, ['PLPRMFL', 'PLNGENAN']].groupby(
    'PLPRMFL', as_index=False).sum()).mark_bar().encode(x='PLPRMFL', y='PLNGENAN')
/home/codespace/.python/current/lib/python3.10/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  for col_name, dtype in df.dtypes.iteritems():
Out[ ]:
In [ ]:
fuelData = data.loc[:, ['PLPRMFL', 'PLNGENAN']].groupby(
    'PLPRMFL', as_index=False).sum()
print(
    f"The fuel that generates the most amount of energy in the US is {fuelData.loc[fuelData.loc[:, 'PLNGENAN'] == fuelData.loc[:, 'PLNGENAN'].max(), 'PLPRMFL'].values[0]}.")
The fuel that generates the most amount of energy in the US is NG.

9¶

[6 points] Vectorization. When we first learn to code and think about iterating over an array, we often use loops. If implemented correctly, that does the trick. In machine learning, we iterate over so much data that those loops can lead to significant slow downs if they are not computationally efficient. In Python, vectorizing code and relying on matrix operations with efficient tools like numpy is typically the faster approach. Of course, numpy relies on loops to complete the computation, but this is at a lower level of programming (typically in C), and therefore is much more efficient. This exercise will explore the benefits of vectorization. Since many machine learning techniques rely on matrix operations, it's helpful to begin thinking about implementing algorithms using vector forms.

Begin by creating an array of 10 million random numbers using the numpy random.randn module. Compute the sum of the squares of those random numbers first in a for loop, then using Numpy's dot module to perform an inner (dot) product. Time how long it takes to compute each and report the results and report the output. How many times faster is the vectorized code than the for loop approach? (Note - your results may vary from run to run).

Your output should use the print() function as follows (where the # symbols represent your answers, to a reasonable precision of 4-5 significant figures):

Time [sec] (non-vectorized): ######

Time [sec] (vectorized): ######

The vectorized code is ##### times faster than the nonvectorized code

ANSWER

In [ ]:
x = np.random.randn(10_000_000)
In [ ]:
def loopedFunction(x):
    loopedSum = 0
    for eachNumber in x:
        loopedSum += eachNumber ** 2


loopedTime = %timeit - q - o loopedFunction(x)
vectorTime = %timeit - q - o np.dot(x, x)
In [ ]:
print(f"Time [sec] (non-vectorized): {loopedTime.average: .5f}")
print(f"Time [sec] (vecotrized): {vectorTime.average: .5f}")
print(
    f"The vectorized code is {loopedTime.average/vectorTime.average: .5f} times faster than the nonvectorized code")
Time [sec] (non-vectorized):  1.91978
Time [sec] (vecotrized):  0.00514
The vectorized code is  373.79528 times faster than the nonvectorized code

10¶

[10 points] This exercise will walk through some basic numerical programming and probabilistic thinking exercises, two skills which are frequently use in machine learning for answering questions from our data.

  1. Synthesize $n=10^4$ normally distributed data points with mean $\mu=2$ and a standard deviation of $\sigma=1$. Call these observations from a random variable $X$, and call the vector of observations that you generate, $\textbf{x}$.
  2. Calculate the mean and standard deviation of $\textbf{x}$ to validate (1) and provide the result to a precision of four significant figures.
  3. Plot a histogram of the data in $\textbf{x}$ with 30 bins
  4. What is the 90th percentile of $\textbf{x}$? The 90th percentile is the value below which 90% of observations can be found.
  5. What is the 99th percentile of $\textbf{x}$?
  6. Now synthesize $n=10^4$ normally distributed data points with mean $\mu=0$ and a standard deviation of $\sigma=3$. Call these observations from a random variable $Y$, and call the vector of observations that you generate, $\textbf{y}$.
  7. Create a new figure and plot the histogram of the data in $\textbf{y}$ on the same axes with the histogram of $\textbf{x}$, so that both histograms can be seen and compared.
  8. Using the observations from $\textbf{x}$ and $\textbf{y}$, estimate $E[XY]$

ANSWER

  1. Below is a vector of $ n = 10^4 $ normally distributed data points with mean $ \mu = 2 $ and standard devation $ \sigma = 1 $.
In [ ]:
mu = 2.0
sigma = 1.0
x = np.random.default_rng().normal(mu, sigma, 10_000)
In [ ]:
print(
    f"The mean of x is {np.mean(x): .4f} and the standard deviation of x is {np.std(x, ddof= 1): .4f}.")
The mean of x is  1.9973 and the standard deviation of x is  1.0056.
  1. Below is a plot of a histogram of the data in x with 30 bins.
In [ ]:
import matplotlib.pyplot as plt
plt.hist(x, 30)
plt.show()
In [ ]:
print(f"The 90th percentile of x is {np.percentile(x, 90): .4f}.")
The 90th percentile of x is  3.2830.
In [ ]:
print(f"The 99th percentile of x is {np.percentile(x, 99): .4f}.")
The 99th percentile of x is  4.2812.
  1. Below is a vector of $ n = 10^4 $ normally distributed data points with mean $ \mu = 0 $ and standard devation $ \sigma = 3 $.
In [ ]:
mu_y = 0.0
sigma_y = 3.0
y = np.random.default_rng().normal(mu_y, sigma_y, 10_000)
  1. Below is a histogram comparing the distribution of x and y.
In [ ]:
plt.hist(x, 30, alpha=.5)
y_plot = plt.hist(y, 30, alpha=.5)
plt.show()
In [ ]:
print(f"The expected value of x*y is {np.mean(np.matmul(x,y)): .4f}.")
The expected value of x*y is  279.8695.

Version Control via Git¶

11¶

[4 points] Git is efficient for collaboration, and expectation in industry, and one of the best ways to share results in academia. You can even use some Git repositories (e.g. Github) as hosts for website, such as with the course website. As a data scientist with experience in machine learning, Git is expected. We will interact with Git repositories (a.k.a. repos) throughout this course, and your project will require the use of git repos for collaboration.

Complete the Atlassian Git tutorial, specifically the following listed sections. Try each concept that's presented. For this tutorial, instead of using BitBucket as your remote repository host, you may use your preferred platform such as Github or Duke's Gitlab.

  1. What is version control
  2. What is Git
  3. Install Git
  4. Setting up a repository
  5. Saving changes
  6. Inspecting a repository
  7. Undoing changes
  8. Rewriting history
  9. Syncing
  10. Making a pull request
  11. Using branches
  12. Comparing workflows

I also have created two videos on the topic to help you understand some of these concepts: Git basics and a step-by-step tutorial.

For your answer, affirm that you either completed the tutorials above OR have previous experience with ALL of the concepts above. Confirm this by typing your name below and selecting the situation that applies from the two options in brackets.

ANSWER

I, Nick Carroll, affirm that I have I have previous experience that covers all the content in this tutorial**


Exploratory Data Analysis¶

12¶

[15 points] Here you'll bring together some of the individual skills that you demonstrated above and create a Jupyter notebook based blog post on your exploratory data analysis. Your goal is to identify a question or problem and to work towards solving it or providing additional information or evidence (data) related to it through your data analysis. Below, we walk through a process to follow for your analysis. Additionally, you can find an example of a well-done exploratory data analysis here from past years.

  1. Find a dataset that interests you and relates to a question or problem that you find intriguing.
  2. Describe the dataset, the source of the data, and the reason the dataset was of interest. Include a description of the features, data size, data creator and year of creation (if available), etc. What question are you hoping to answer through exploring the dataset?
  3. Check the data and see if they need to be cleaned: are there missing values? Are there clearly erroneous values? Do two tables need to be merged together? Clean the data so it can be visualized. If the data are clean, state how you know they are clean (what did you check?).
  4. Plot the data, demonstrating interesting features that you discover. Are there any relationships between variables that were surprising or patterns that emerged? Please exercise creativity and curiosity in your plots. You should have at least a ~3 plots exploring the data in different ways.
  5. What insights are you able to take away from exploring the data? Is there a reason why analyzing the dataset you chose is particularly interesting or important? Summarize this for a general audience (imagine your publishing a blog post online) - boil down your findings in a way that is accessible, but still accurate.

Here your analysis will evaluated based on:

  1. Motivation: was the purpose of the choice of data clearly articulated? Why was the dataset chosen and what was the goal of the analysis?
  2. Data cleaning: were any issues with the data investigated and, if found, were they resolved?
  3. Quality of data exploration: were at least 4 unique plots (minimum) included and did those plots demonstrate interesting aspects of the data? Was there a clear purpose and takeaway from EACH plot?
  4. Interpretation: Were the insights revealed through the analysis and their potential implications clearly explained? Was there an overall conclusion to the analysis?

ANSWER

  1. The dataset I analyzed was the daily stock price data of Microsoft from 3/13/1986 (the date of their IPO) to 12/16/2019 (the date the dataset was uploaded to Kaggle). This dataset was taken from the larger dataset on Kaggle: Stock Market Data (Nasdaq, NYSE, S&P500) uploaded by Paul Mooney. It can be found at https://www.kaggle.com/datasets/paultimothymooney/stock-market-data. The dataset was chosen becuase I am invested in Microsoft, and I want to understand what insights can be understood from historical pricing data. The dataset consists of 8,518 rows and 7 column, which are associated with the pricing information for each date of the 8,518 dates. Ideally, the goal is to understand to what certainty prices can be predicted from historical pricing. The pricing information consists of the daily opening price, high price, low price, closing price, adjusted closing price (which is adjusted to account for splits and dividends), and the volume of trades.
  1. There was no missing values in the data; however, to understand if days of information was missing, I checked which days of the week had data and compared each date to the previous date, analyzing the number of days between dates. Below is a histogram showing the days of the week of each of the data. Below is a histogram showing the delta dates of the data. While the data distribution appears approximately as expected, what is surprising is that there are three dates which are more than 4 days after the previous trading day. Upon further review, those days are 9/17/2001 (the first day of trading after the stock market closed for the 9/11 attacks), 1/3/2007 (the stock market was closed on 1/2/2007 for a day of mourning for President Ford), and 10/31/2012 (the first day of trading after the stock market was closed due to Hurricane Sandy). Therefore, there appears to be no missing information.
In [ ]:
data = pd.read_csv('data/MSFT.csv', parse_dates=['Date'])

(data.loc[:, 'Date'].dt.dayofweek.value_counts().sort_index().rename(
    {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday'})).plot(kind='bar', title='Histogram of MSFT Stock Price Data by Day of the Week', xlabel='Day of the Week', ylabel='Frequency')
plt.show()
In [ ]:
(data.loc[:, 'Date'] - data.loc[:, 'Date'].shift()).value_counts().sort_index().plot(kind='bar',
                                                                                     title='Histogram of Number of Days between Consecutive Trading Days in Microsoft Stock Price Data', xlabel='Number of Days', ylabel='Frequency')
plt.show()

The below table shows dates that have large gaps between its previous trading day.

In [ ]:
dateRange = pd.date_range(start='1/1/2021', periods=5)
dateRange.max() - dateRange.min()
data.loc[(data.loc[:, 'Date'].shift() - data.loc[:, 'Date'])
         < dateRange.min() - dateRange.max(), :]
Out[ ]:
Date Open High Low Close Adj Close Volume
3916 2001-09-17 27.010000 27.549999 26.4 26.455000 16.628422 127502000
5249 2007-01-03 29.910000 30.250000 29.4 29.860001 21.734066 76935100
6717 2012-10-31 28.549999 28.879999 28.5 28.540001 23.456957 69464100

The below chart shows the closing prices of Microsoft stock over time. Microsoft's stock price appears to grow exponentially over the long run, but has had a dramatic reduction following the year 2000. The second chart shows that there was a second wave of dramatic growth following 2012 (the graphs needed to be separated due to the number of data points plotted). Over a short period of time, there is a lot of noise in the stock price.

In [ ]:
alt.Chart(data.iloc[:5000], title='Microsoft Stock Price over Time').mark_line(
).encode(x='Date', y=alt.Y('Close', title='Closing Price ($)'))
Out[ ]:
In [ ]:
alt.Chart(data.iloc[data.shape[0]-5000:], title='Microsoft Stock Price over Time').mark_line(
).encode(x='Date', y=alt.Y('Close', title='Closing Price ($)'))
Out[ ]:
In [ ]:
print(f"The average closing stock price for Microsoft over this time period is ${data.loc[:, 'Close'].mean(): .2f} and the standard deviation of their stock price over this period is ${data.loc[:, 'Close'].std(): .2f}.")
print(f"Microsoft's maximum price was ${data.loc[:, 'High'].max(): .2f}, this was reached on {np.datetime_as_string(data.loc[data.loc[:, 'High'] == data.loc[:, 'High'].max(), 'Date'].values[0], unit = 'D')}.")
The average closing stock price for Microsoft over this time period is $ 28.12 and the standard deviation of their stock price over this period is $ 28.39.
Microsoft's maximum price was $ 158.73, this was reached on 2019-12-26.
In [ ]:
log_data = data.copy()
log_data.loc[:, "log_close"] = np.log(log_data.loc[:, 'Close'])

The below table shows the log of the stock price data over time, which appears to be a more linear relationship than the previously shown table of stock price over time. There appears to be one fairly consistent slope prior to 2000, and another slope of approximately 0 after 2000, before a third slope forms after 2012.

In [ ]:
alt.Chart(log_data.iloc[:5000], title='Log of Microsoft Stock Price Data over Time').mark_line(
).encode(x='Date', y=alt.Y('log_close', title='log of Closing Price (log($))'))
Out[ ]:
In [ ]:
alt.Chart(log_data.iloc[data.shape[0]-5000:], title='Log of Microsoft Stock Price Data over Time').mark_line(
).encode(x='Date', y=alt.Y('log_close', title='log of Closing Price (log($))'))
Out[ ]:

Below is a Histogram showing the daily percent change in Microsoft's stock price. It shows that the mode is positive, but very small (< 1%).

In [ ]:
log_data.loc[:, 'Percent Change'] = log_data.loc[:, 'Close'] / log_data.loc[:, 'Close'].shift() - 1
log_data.loc[:, 'Percent Change'].hist(bins = 30)
plt.title('Histogram of Daily Percent Change in Price of Microsft Stock')
plt.xlabel('Daily Percent Change in Price of Microsft Stock')
plt.ylabel('Frequency')
plt.show()
In [ ]:
print(f"The average percent daily change of Microsoft's stock price over this period is {log_data.loc[:, 'Percent Change'].mean() * 100: .2f}%.")
print(f"The standard deviation of the percent daily change of Microsoft's stock price over this period is {log_data.loc[:, 'Percent Change'].std() * 100: .2f}%.")
print(f"The largest daily percent in Microsoft's stock price was {log_data.loc[:, 'Percent Change'].min() * 100: .2f}%, which occurred on {np.datetime_as_string(log_data.loc[log_data.loc[:, 'Percent Change'] == log_data.loc[:, 'Percent Change'].min(), 'Date'].values[0], unit = 'D')}.")
The average percent daily change of Microsoft's stock price over this period is  0.11%.
The standard deviation of the percent daily change of Microsoft's stock price over this period is  2.13%.
The largest daily percent in Microsoft's stock price was -30.12%, which occurred on 1987-10-19.

This analysis has shown that Microsoft's stock price has had multiple phases of growth, some rapid, but with a plateau between the .com bubble and the morgage backed security bubble. Up to the most current data in this dataset, Microsoft's stock price appears to continue growing, but the potential of another bubble could provide investor's with another long period of minimal returns.